NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Comment on “Data Fission: Splitting a Single Data Point” Data Fission for Unsupervised Learning: A Discussion on Post-Clustering Inference and the Challenges of Debiasing

https://doi.org/10.1080/01621459.2024.2412191

Wang, Changhu; Ge, Xinzhou; Song, Dongyuan; Li, Jingyi Jessica (January 2025, Journal of the American Statistical Association)

Full Text Available
Response to "Neglecting normalization impact in semi-synthetic RNA-seq data simulation generates artificial false positives" and "Winsorization greatly reduces false positives by popular differential expression methods when analyzing human population samples"

https://doi.org/10.1186/s13059-024-03232-8

Ge, Xinzhou; Li, Yumei; Li, Wei; Li, Jingyi Jessica (December 2024, Genome Biology)

Abstract Two correspondences raised concerns or comments about our analyses regarding exaggerated false positives found by differential expression (DE) methods. Here, we discuss the points they raise and explain why we agree or disagree with these points. We add new analysis to confirm that the Wilcoxon rank-sum test remains the most robust method compared to the other five DE methods (DESeq2, edgeR, limma-voom, dearseq, and NOISeq) in two-condition DE analyses after considering normalization and winsorization, the data preprocessing steps discussed in the two correspondences.
more » « less
Full Text Available
APIR: Aggregating Universal Proteomics Database Search Algorithms for Peptide Identification with FDR Control

https://doi.org/10.1093/gpbjnl/qzae042

Chen, Yiling Elaine; Ge, Xinzhou; Woyshner, Kyla; McDermott, MeiLu; Manousopoulou, Antigoni; Ficarro, Scott B; Marto, Jarrod A; Li, Kexin; Wang, Leo David; Li, Jingyi Jessica (April 2024, Genomics, Proteomics & Bioinformatics)
Fu, Yan (Ed.)
Abstract Advances in mass spectrometry (MS) have enabled high-throughput analysis of proteomes in biological systems. The state-of-the-art MS data analysis relies on database search algorithms to quantify proteins by identifying peptide–spectrum matches (PSMs), which convert mass spectra to peptide sequences. Different database search algorithms use distinct search strategies and thus may identify unique PSMs. However, no existing approaches can aggregate all user-specified database search algorithms with a guaranteed increase in the number of identified peptides and a control on the false discovery rate (FDR). To fill in this gap, we proposed a statistical framework, Aggregation of Peptide Identification Results (APIR), that is universally compatible with all database search algorithms. Notably, under an FDR threshold, APIR is guaranteed to identify at least as many, if not more, peptides as individual database search algorithms do. Evaluation of APIR on a complex proteomics standard dataset showed that APIR outpowers individual database search algorithms and empirically controls the FDR. Real data studies showed that APIR can identify disease-related proteins and post-translational modifications missed by some individual database search algorithms. The APIR framework is easily extendable to aggregating discoveries made by multiple algorithms in other high-throughput biomedical data analysis, e.g., differential gene expression analysis on RNA sequencing data. The APIR R package is available at https://github.com/yiling0210/APIR.
more » « less
Full Text Available
Developmental isoform diversity in the human neocortex informs neuropsychiatric risk mechanisms

https://doi.org/10.1126/science.adh7688

Patowary, Ashok; Zhang, Pan; Jops, Connor; Vuong, Celine K; Ge, Xinzhou; Hou, Kangcheng; Kim, Minsoo; Gong, Naihua; Margolis, Michael; Vo, Daniel; et al (May 2024, Science)

RNA splicing is highly prevalent in the brain and has strong links to neuropsychiatric disorders; yet, the role of cell type–specific splicing and transcript-isoform diversity during human brain development has not been systematically investigated. In this work, we leveraged single-molecule long-read sequencing to deeply profile the full-length transcriptome of the germinal zone and cortical plate regions of the developing human neocortex at tissue and single-cell resolution. We identified 214,516 distinct isoforms, of which 72.6% were novel (not previously annotated in Gencode version 33), and uncovered a substantial contribution of transcript-isoform diversity—regulated by RNA binding proteins—in defining cellular identity in the developing neocortex. We leveraged this comprehensive isoform-centric gene annotation to reprioritize thousands of rare de novo risk variants and elucidate genetic risk mechanisms for neuropsychiatric disorders.
more » « less
Full Text Available
Exaggerated false positives by popular differential expression methods when analyzing human population samples

https://doi.org/10.1186/s13059-022-02648-4

Li, Yumei; Ge, Xinzhou; Peng, Fanglue; Li, Wei; Li, Jingyi_Jessica (March 2022, Genome Biology)

Abstract When identifying differentially expressed genes between two conditions using human population RNA-seq samples, we found a phenomenon by permutation analysis: two popular bioinformatics methods, DESeq2 and edgeR, have unexpectedly high false discovery rates. Expanding the analysis to limma-voom, NOISeq, dearseq, and Wilcoxon rank-sum test, we found that FDR control is often failed except for the Wilcoxon rank-sum test. Particularly, the actual FDRs of DESeq2 and edgeR sometimes exceed 20% when the target FDR is 5%. Based on these results, for population-level RNA-seq studies with large sample sizes, we recommend the Wilcoxon rank-sum test.
more » « less
Clipper: p-value-free FDR control on high-throughput data from two conditions

https://doi.org/10.1186/s13059-021-02506-9

Ge, Xinzhou; Chen, Yiling_Elaine; Song, Dongyuan; McDermott, MeiLu; Woyshner, Kyla; Manousopoulou, Antigoni; Wang, Ning; Li, Wei; Wang, Leo_D; Li, Jingyi_Jessica (October 2021, Genome Biology)

Abstract High-throughput biological data analysis commonly involves identifying features such as genes, genomic regions, and proteins, whose values differ between two conditions, from numerous features measured simultaneously. The most widely used criterion to ensure the analysis reliability is the false discovery rate (FDR), which is primarily controlled based onp-values. However, obtaining validp-values relies on either reasonable assumptions of data distribution or large numbers of replicates under both conditions. Clipper is a general statistical framework for FDR control without relying onp-values or specific data distributions. Clipper outperforms existing methods for a broad range of applications in high-throughput data analysis.
more » « less
DORGE: Discovery of Oncogenes and tumoR suppressor genes using Genetic and Epigenetic features

https://doi.org/10.1126/sciadv.aba6784

Lyu, Jie; Li, Jingyi Jessica; Su, Jianzhong; Peng, Fanglue; Chen, Yiling Elaine; Ge, Xinzhou; Li, Wei (November 2020, Science Advances)
null (Ed.)
Data-driven discovery of cancer driver genes, including tumor suppressor genes (TSGs) and oncogenes (OGs), is imperative for cancer prevention, diagnosis, and treatment. Although epigenetic alterations are important for tumor initiation and progression, most known driver genes were identified based on genetic alterations alone. Here, we developed an algorithm, DORGE (Discovery of Oncogenes and tumor suppressoR genes using Genetic and Epigenetic features), to identify TSGs and OGs by integrating comprehensive genetic and epigenetic data. DORGE identified histone modifications as strong predictors for TSGs, and it found missense mutations, super enhancers, and methylation differences as strong predictors for OGs. We extensively validated DORGE-predicted cancer driver genes using independent functional genomics data. We also found that DORGE-predicted dual-functional genes (both TSGs and OGs) are enriched at hubs in protein-protein interaction and drug-gene networks. Overall, our study has deepened the understanding of epigenetic mechanisms in tumorigenesis and revealed previously undetected cancer driver genes.
more » « less
Full Text Available

Search for: All records